
US006230153B1 



d2) United States Patent (io> Patent No.: us 6,230,153 Bi 

Howard et aL (45) Date of Patent: May 8, 2001 



(54) ASSOCIATION RULE RANKER FOR WEB 
SITE EMULATION 

(75) Inventors: Steven Kenneth Howard, Irving, TX 
(US); David Charles Martin, San Jose; 
Mark Earl Paul Plutowskl, Santa 
Cruz, both of CA(US) 

(73) Assignee: International Business Machines 
Corporation, Armonk, NY (US) 

( * ) Notice: Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 0 days. 

(21) Appl. No.: 09/099,538 

(22) Filed: Jun. 18, 1998 

(51) Int. CI. 7 G06F 17/30 

(52) U.S. CI 707/2; 707/200; 707/7; 

705/5 

(58) Field of Search 707/1-7, 200, 

707/205; 705/5, 10 

(56) References Cited 

U.S. PATENT DOCUMENTS 

5,615,341 3/1997 Agrawal et al 705/10 

5,668,988 * 9/1997 Chen et al 707/101 

5,724,573 * 3/1998 Agrawal et al 707/6 

5,832,482 * 11/1998 Yu et al 707/6 

6,061,682 * 5/2000 Agrawal et al 707/6 

OTHER PUBLICATIONS 

Han et al, Discovery Of Multiple Level Association Rules 
Form Large Databases, Proceedings of the 21st International 
Conference on Very Large Data Bases, Zurich, Switzerland, 
Sep. 11-15, 1995, pp. 420-431. 

II. Mannila et al, "Improved Methods For Finding Associa- 
tion Rules", Pub. No. C-1993-65, 20 pages, Univ. Helsinki, 
1993. 

Savascre et al, "An Efficient Algorithm For Mining Asso- 
ciation Rules In Large Databases", Proceedings of the 21st 
VLDB Conference, Zurich, Switzerland, 1995, pp. 432-444. 
Srikant et al, "Mining Generalized Association Rules", 
Proceedings of the 21st VLDB Conference, Zurich, Swit- 
zerland, 1995, pp. 407^19. 



UUah, "Entropy, divergence and distance measures with 
economic app hca lions", Journal Of Statistical Planning And 
Interence, Elsevier 1993, pp. 137-163. 
J.S. Park, et al, "Efficient Parallel Data Mining For Asso- 
ciation Rules", IBM Research Report, RJ 20156, Aug. 1995. 
J.S. Park et al, "An Effective Hash Based Algorithm For 
Mining Association Rules", Proc. ACM-SIGMOD Conf. On 
Management of Data, San Jose, May 1994. 
Agrawal et al, "Parallel Mining Of Association Rules: 
Design, Implementation, And Experience", IEEE Transac- 
tion On Knowledge Data Engineering, vol. 8, No. 6, pp. 
962-969, Dec. 1996. 

Argrawal et al, "Fast Algorithms For Mining Association 
Rules", Proceedings of the 1994 VLDB Conferences, pp. 
487-499, 1994. 

Agrawal et al, "Mining Association Rules Between Sets of 
Items In Urge Databases", Proc. 1993 ACM SIGMOD 
Conf. pp. 207-216, 1993. 

Piatetsky-Shapiro, Chapter 13 "Discovery, Analysis, And 
Presentation Of Strong Rules", from Knowledge Discovery 
in Databases, pp. 229-248, AAAI/MIT,Press, Menlo Park, 
Ca 1991. 

Swami, "Research Report: Set-Oriented Mining For Asso- 
ciation Rules", IBM Research Division, RJ 9567 (83573 
Oct. 1993. 

* cited by examiner 

Primary Examiner — Hosain T. Alam 

Assistant Examiner — Jean Bolte Fleurantin 

(74) Attorney, Agent, or Firm — Gray Cary Ware & 

Freidenrich LLP 



(57) 



ABSTRACT 



A method and apparatus that allows association rules defin- 
ing URL-URL relationships, and URL-URL relationships 
that are strongly influenced by a web site's topology, to be 
identified and respectively qualified. Superfluous associa- 
tion rules may be separated from non-topology affected 
association rules and discounted as desired. The invention 
may be implemented in conjunction with a probalistic 
generative method used to model a web site and simulate the 
behavior of a visitor traversing the site. The invention 
further allows randomized web site visitor behavior to be 
separated into "interesting" and "uninteresting" behavior. 

69 Claims, 8 Drawing Sheets 
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ASSOCIATION RULE RANKER FOR WEB 
SITE EMULATION 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to applying data mining 
association rules to sessionized web server log data. More 
particularly, the invention enhances data mining rule dis- 
covery as applied to log data by reducing large numbers of 
candidate rules to smaller rule sets. ^ 

2. Description of the Related Art • 

Traditionally, discovery of association rules for data min- 
ing applications has focused extensively on large databases 
comprising customer data. For example, association rules 
have been applied to databases consisting of "basket data" — 15 
items purchased by consumers and recorded using a bar- 
code reader — so that the purchasing habits of consumers can 
be discovered. This type of database analysis allows a 
retailer lo know with some certainty whether a consumer 
who purchases a first set of items, or "ite inset," can be 20 
expected to purchase a second itemset at the same time. This 
information can then be used to create more effective store 
displays, inventory controls, or marketing advertisements. 
However, these data mining techniques rely on randomness, 
that is, that a consumer is not restricted or directed in making 2 $ 
a purchasing decision. 

When applied to traditional data such as conventional 
consumer tendencies, the association rules used can be 
order-ranked by their strength and significance to identify 
interesting rules (i.e. relationships.) But this type of sorting 30 
metrics is less applicable to sessionized web site data 
because site imposed associations exist within the data. 
Imposed associations may be constraints uniformly imposed 
on visitors to the web site. For example, to determine a 
relationship between site pages that web site visitors 35 
(visitors) find "interesting" using traditional data mining 
association rules, a researcher might look at pages that have 
strong link associations. However, for typical web site data, 
this type of association rule would probably be meaningless 
because of the site's inherent topology as discussed below. 40 

Associations amongst web site pages — web site pages 
being commonly identified by their respective uniform 
resource locator (URL)— exhibit behavior biased by at least 
two major effects: 1) the preferences and intentionality of the 
visitor; and, 2) traffic flow constraints imposed on the visitor 45 
by the topology of the web site. Association rules used to 
uncover the preferences and intentionalities of visitors can 
be overwhelmed by the effects of the imposed constraints. 
The result is that a large number of "superfluous" rules — 
rules having high strength and significance yet essentially 50 
uninformative with respect to true visitor preferences — may 
be discovered. Commonly, these superfluous rules tend to be 
the least interesting to the researcher. 

For example, association rules can be used to identify 
unsafe patterns of sessionized visits to a web site. Such rules 55 
deliver statements of the form "75% of visits from referrer 
A belong to segment B." Traffic flow patterns can also be 
uncovered in the form of statements such as "45% of visits 
to page A also visit page B." However, such rules that 
characterize behavior due to intentionality of the visitor will 60 
tend to be overwhelmed by rules that are due to the traffic 
flow patterns imposed upon the visitor by the site topology. 
Therefore, sorting these rules in the conventional manner 
will place high importance on rules of the form "100% of 
visitors that invoked URL A also visited URL B." When a 65 
visitor's conduct is dominated by the web site topology, 
rules emanating from such conduct need to be discounted. 
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Thresholding out the strongest associations between web 
site pages is neither practical nor desirable, and manually 
wading through mined association rules for such associa- 
tions would be excruciatingly tedious and defeat the basic 
premise upon which data mining was developed. 

What is desperately needed is a way to identify associa- 
tion rules that are strongly influenced by web site topology 
and therefore considered uninteresting as an association 
rule. Further, there is a need for the ability to eliminate 
superfluous association rules from sessionalized web site log 
data and yet retain the superfluous rules for future use. 

SUMMARY OF THE INVENTION 

Broadly, the present invention allows association rules-^ 
that are strongly influenced by a web site's topology to be 
identified. These superfluous association rules may be sepa- 
rated from non-topology affected association rules and dis- 
counted as desired. 

In one embodiment, the present invention is implemented 
in conjunction with a method to model a web site and 
simulate the behavior of a visitor traversing the site. The 
methods of the present invention are practiced upon the data 
generated by the generative model, also referred to as the 
Web Walk Emulator, and disclosed in U.S. Patent Applica- 
tion entitled "WEB WALKER EMULATOR," by Steven 
Howard et al., assigned to the assignee of the current 
invention, incorporated by reference herein and being filed 
concurrently herewith. The present invention allows ran- 
domized behavior within an emulated session to be reduced 
into "interesting" and "uninteresting" behavior. In another 
embodiment, the present invention may be practiced upon 
data accumulated from actual web site visits. 

In another embodiment, the invention may be imple- 
mented to provide a method to sort association rules by their 
relative empirical frequency (relevance), or support, within 
a database comprising URL data. This relevance ranking is 
dependant upon the URLs constituting a complete set of 
events, and ranks rules where the relevance of each data set 
is measured by comparing its associational support against 
the reference given by an emulated distribution. In another 
embodiment, rules within a set of rules may be compared. 
The degree deviation of the relevance, or likelihood, of a 
rule is compared to a reference, such as the number 1, to 
determine peaks and lows. These peaks and lows are used to 
determine whether the behavior of actual users compares 
favorably with the behavior of emulated users. In another 
embodiment, theseTr uIes^may'be'f^rther-sort ed-to determine-^ 
point-by-pomt-relevance~mformatio n~t^ 
^thaPshare a common _ likelihood~ratio yet have different 
supports. 

In another embodiment, associations may be ranked even 
if the URLs comprise an incomplete system of events that 
may render an emulated choice non-mulually exclusive. In 
this case, the events are converted into a probability distri- 
bution and sorted. In still another embodiment, the con- 
verted events may be sorted using more sensitive associa- 
tions to seek out rules that have unusual levels of support 
compared to a baseline reference distribution. In another 
embodiment, association rules may be ranked by their 
confidence to estimate these conditional probabilities. 

In still another embodiment, the invention may be imple- 
mented to provide an apparatus to sort association rules as 
described in regards to the various methods of the invention. 
The apparatus may include a client computer interfaced with 
a server computer used to sort the associations. 

In still another embodiment, the invention may be imple- 
mented to provide an article of manufacture comprising a 
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data storage device tangibly embodying a program of 
machine-readable instructions executable by a digital data 
processing apparatus to perform method steps for sorting 
association rules as described with regards to the various 
methods of the invention. 5 

The invention affords its users with a number of distinct 
advantages. One advantage is that the invention provides a 
way to avoid the necessity of storing massive amounts of 
historical URL data used to make future comparisons 
regarding the actions of a user traversing a web site. Another 10 
advantage is that the invention reduces the computational 
time required to process URL data and associations. 

Further, the invention allows the evaluation of "emulated" 
events that did not actually occur, allowing future behavior 
of a web site user to be studied using these events. 15 

BRIEF DESCRIPTION OF THE DRAWING 

The nature, objects, and advantages of the invention will 
become more apparent to those skilled in the art after 20 
considering the following detailed description in connection 
with the accompanying drawings, in which like reference 
numerals designate like parts throughout, wherein: 

FIG. 1 is a block diagram of the hardware components 
and interconnections for discovering association rules in 25 
accordance with one embodiment of the invention; 

FIG. 2 is a flowchart of an operational sequence to sort 
association rules in accordance with one embodiment of the 
invention; 

30 

FIG. 3 is a flowchart of an operational sequence to sort 
association rules in accordance with one embodiment of the 
invention; 

FIG. 4 is a flowchart of an operational sequence to sort 
association rules in accordance with one embodiment of the 35 
invention; 

FIG. 5 is a flowchart of an operational sequence to sort 
association rules in accordance with one embodiment of the 
invention; 

FIG. 6 is a flowchart of an operational sequence to sort 40 
association rules in accordance with one embodiment of the 
invention; 

FIG. 7 is a flowchart of an operational sequence for 
sorting association rules in accordance with the invention; 
and « 

FIG. 8 is a perspective view of an exemplary signal- 
bearing medium in accordance with the invention. 

DETAILED DESCRIPTION OF THE 

PREFERRED EMBODIMENTS 50 

The present invention concerns discovering association 
rules in sessionized web server log data in the presence of 
constraints that may be expressed as Boolean expressions 
over the presence or absence of items in the database. Such 55 
constraints allow users to specify a subset of rules in which 
the users are interested. The constraints are integrated into 
an association rule discovery method instead of being per- 
formed in a post-processing step, thereby substantially 
reducing the time required to discover association rules. ^0 

The present invention includes various preferred methods 
for generating "candidate" itemsets and may be imple- 
mented in a broader sense as discussed in U.S. Pat. No. 
5,615,341, Agrawal et al., for "SYSTEM AND METHOD 
FOR MINING GENERALIZED ASSOCIATION RULES 65 
IN DATABASES." assigned to the assignee of the current 
invention and incorporated herein by reference above. 
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Furthermore, the present invention may be used in conjunc- 
tion with other methods using candidate generation, such as 
disclosed in Toivonen, "Sampling Large Databases for Asso- 
ciation Rules," Proc. of the 22nd Int'l. Conf. on Very Large 
Databases (VLDB), Mumbai (Bombay), India, September 
1 996, and may be applied directly to the methods disclosed 
in Srikant et al., "Mining Generalized Association Rules," 
Proc. of the 21st IntT. Conf. on Very Large Databases 
(VLDB). Zurich, Switzerland, September 1995. Each of the 
above references are also incorporated by reference herein. 

To better understand the methods of the invention, a 
general statement of the relationships, nomenclature, and 
environment used to implement the various embodiments of 
the invention follows in sections A-E. Thereafter, the appa- 
ratuses, methods and signal bearing mediums of the present 
invention are described. 

A. Introduction 

A "session" is an ordered set of URLs associated with a 
particular visitor to a web site. A session tracks the "click- 
stream lifespan" of a visitor to a web site. "Sessionizing" 
web server log data involves splitting the data into mutually 
disjoint sessions. The click-stream lifespan of a session 
therefore consists of the sequence of URLs visited along the 
way during the session. 

Sessionized visits allow the invention to discover where 
visits come from, and where a user traversing the site tends 
to exit the site. For instance, given a set of "referrers" (sites 
from which a visit originates), and a set of candidate "exit 
pages" (URLs which may serve as the final URL visited 
during a session) the invention may evaluate the probability 
that a session originated from a particular referrer, as well as 
the probability that a session ends via a particular exit page. 
Possible associations between the two may be discovered by 
examining the probability that a session will end via a 
particular exit page, given that the session originated from a 
particular URL. In another example, the intuition may 
discover whether visitors to page A also tend to visit page B, 
where page A and B can be chosen from choices that are not 
mutually exclusive over the life of a session, and whereas 
each session has only a single referrer, entry page, and exit 
page. 

B. Definitions 

Let U-{u A , u 2 , . , . , u R } be a table of URLs, and let ueU. 
A session s=(s a , s^ . . . , s L % s^U, i=l, 2, . . . , L, for some 
finite integer L=L(s)=#s. Therefore, a session is a sequence 
of URLs. 

Further, Q-UL/l=lxL/i=lU where L=o>. For a finite 
session, L is finite. Observed sessions are a realization on the 
probability space (Q, 2°, ju) where 2" is the sigma algebra 
in Q given by the set of all subsets of Q. An element to of 
Q — where to is a "sample point" — gives a realization of a 
session, and an element 2 Q (an "event") is a set of sessions. 
Q can be thought of as an index table containing pointers 
into the set of all "permissible" sessions, where for conve- 
nience all sessions of up to length L are considered. If S is 
a random session from this set, and given a particular to in 
Q, s=S(Q) and denotes a realization of S. Therefore, if 
A«{all sessions containing u} then the probability that a 
random session contains (i.e., "visits") u at least once is 
given by //(A). 

The probability that a random session S contains u may be 
denoted by P(ueS), or in a simple embodiment simply P(u). 
Likewise, the probability that a session visits, for example, 
both u 2 and u 2 may be denoted by P(u 2 , Uj). 

C. Association Rules and Web Sites 

Association rules find regularities between sets of items, 
for example, when an association rule A-»B indicates that 
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transactions of a database which contain A also contain B. be explained by random choice, for example, for visitors 

Either the left hand side ("antecedent" or "head") or the right making completely random choices at each decision point, 

hand side ("consequent" or "body") can comprise multiple Alternatively, it might be of substantial interest if these 

events. Rules of the form Uj U2-»u 3 u 4 u 5 may be encoun- confidences were instead 5%, 5%, and 90%. It might be even 

tered. A rule A-»B is defined as having a confidence c% over 5 more interesting where E-*D, where E is a page that does 

a set of sessions if c% of the sessions that contain A also not have direct links to either A or D, yet rule E-*D has 

contain B, and support s if s% of all sessions contain both confidence of 10%. Although 10% may seem like a low 

A and B. number relative to the examples just considered, this level of 

Efficient algorithms for finding association rules have confidence may actually be striking if it is due to apparent 

been provided for mining large databases such as discussed 10 strong mutual interest in both E and D even though the two 

in Agrawal et al., "Fast Discovery of Association Rules," pages are not directly accessible from each other. 

Advances in Knowledge Discovery and Data Mining, Currently, eliminating these types of problems by direct 

Fayyad, U. M. et al. eds., AAAI Press/The MIT Press, Menlo analysis of a web site's topology is either impractical or 

Park, Calif., 1996. However, when applied to web server entirely unachievable. For example, graph connectivity 

data, the problem arises that an abundant set of rules must 15 analysis alone does not suffice, because solving this problem 

be "distilled" to a manageable size. One way is to rank order requires knowledge of the routing between traversal links. In 

rules according to measures of "relevance," "strength," or actual web server logs, the situation is complicated by the 

"importance." One measure of relevance is the support s. A fact that pages tend to be accessible in multiple ways, and 

useful measure of strength is given by the confidence c. that links can appear on multiple pages. Furthermore, pages 

Other candidate measures are the product of the two, such as 20 can be created dynamically depending upon the attributes of 

cs, as well as c log c and s log s. In conventional transac- the visitor. Because page content can determine the link 

tional databases, these measures can be meaningful, as s traversal topology, web site topology itself can therefore be 

measures the portion of transactions in which a rule is dynamic, 

relevant, and c gives a direct measure of the associational D. The "Web Walker Emulator" 

strength. 25 The Web Walker Emulator incorporated by reference 

However, when used to rank order rules over URLs above may be used to implement the methods of the present 

gleaned from sessionized web server log data, ranking by invention. In one embodiment, the Web Walker Emulator is 

confidence and support can yield poor results. This is the a method for creating a probabilistic generative model of a 

case when association rules are used to analyze traffic flow web site that simulates the behavior of visitors traversing 

patterns of visits to a site, and then those traffic flow patterns 30 through the site. This simulation "emulates" the behavior of 

are used to infer regularities about the preferences and actual visitors to a web site. The parameterization of the 

intentionality of the visitors. Association rules based on simulation can be adjusted in one embodiment such that 

confidence and support detect regularities in traffic flow these "emulated" visitors display behavior that is substan- 

regardless of whether they are due to intentionality on the tially indistinguishable from those of actual users (or a 

part of the visitor, or due to forced paths imposed upon the 35 subset thereof) with respect to population statistics observed 

visitor by the web site structure. A rule with substantial over their respective traffic patterns. Or, in another 

support s and strong confidence c can be uninteresting. This embodiment, it can be tuned to display hypothetical behav- 

follows from what we know about the web site construction, ior such as visitors acting without evidence of intentional 

because essentially all visits are subject to certain traffic flow choice. Tracking the site usage traffic of emulated visitors 

constraints, that may provide little option for choice. 40 may yield a set of reference distributions ("emulated 

A particular example is given by an entry form. To view distributions") against which may be compared the site 

the entry form, one must visit URL "E " This is a matter of usage distributions obtained for actual users. The emulated 

choice as not all visitors must view the form. To submit the distributions are used to implement estimation methods 

form, one must visit URL F. This is also a matter of choice, which measure relative information content. The Kullback- 

because not all visitors that view the form must submit it. 45 Liebler Information Criterion and the Bayesian criteria, 

However, it is unsurprising that the rule F->E will have widely known to those schooled in the art, are two such 

confidence of 100% if all visitors that submit the form must estimation methods. The result is a set of reference distri- 

also view the form because this association is not a matter butions against which the distributions obtained for actual 

of choice: the two-necessarily occur together if F occurs at users may be compared, 

all. 50 E. Applying Emulated Distributions 

Another scenario may arise where page A on a given web A set of session logs derived from actual visits to a site 
site has links to pages B, C, and D, and where these three generally provide the basis for a set of distributions that 
pages are not accessible from links off any other page other describe the behavior of those visits. In particular, these 
than A on this site. Then rules B-»A, C-»A, and D-*A may distributions describe behavior that is visible from the web 
have confidences of 100% for the same reason: namely, 55 server. If a distribution based upon behavior that is unob- 
traffic flow constraints impose this regularity. On the other servable to the web server is obtained, such a distribution 
hand, consider the rules A->B, A->C, and A-*D. may embody behaviors that are known to exist but are 
Furthermore, assume that page A has no other links. If these unobservable or, purely hypothetical. However, the avail- 
rules have confidences of 33%, 33%, and 34% respectively, ability of such a distribution allows differences between 
it indicates a very balanced distribution of traffic across these 60 arbitrary distributions to be discovered. This is useful in 
three links. This fact might be interesting to the administra- cases where conventional statistics are unsatisfactory. For 
tor of the site, or even to the web architect whose job it is instance, conventional statistical analysis of "significance," 
to arrange the content on the site to suit the visitors* or of associational "strength" and "relevance" implicitly 
preferences. On the other hand, it may be less interesting to assume that the reference distribution is a uniform 
those most interested in traffic flowing to page D. Although 65 distribution, that is, where sample points are equally likely 
it receives slightly more traffic than the other two paces, the under the same hypothesis. In certain applications such 
traffic flow it receives is not much more than that which can statistics are at best unsatisfactory or at worst misleading, 
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because the preferable null hypothesis is one where the 
sample points are drawn from a distribution with different 
yet known qualities. In one embodiment, the present inven- 
tion allows randomized behavior within an emulated session 
to result in highly structured behavior that is "significant" in 5 
the usual statistical sense. 

In the present invention, a reference distribution allows 
powerful and general-purpose information theoretic statis- 
tics to be applied as discussed below for extracting infor- 
mation from a distribution of interest. The Kullback-Liebler 10 
Information Criterion (KLIC) mentioned above is one such 
method that can be used by the present invention for 
discriminating between distributions. In particular, it mea- 
sures the directional divergence between two distributions, 
meaning that the measure is not symmetric. Although it is is 
not a distance measure, it is sometimes referred to as the 
"KL-distance." It is also easy to construct a variation of the 
KLIC that yields a non-directional pseudo-distance measure 
(cf .[Ullah, A., "Entropy, Divergence and Distance Measures 
with Econometric Applications," Working Paper in 20 
Economics, Department of Economics, University of 
California, Riverside, Riverside, Calif., Journal of Statisti- 
cal Planning and Inference, 49:137-162, 1996]). For back- 
ground on the KLIC see White, H., "Parametric Statistical 
Estimation with Artificial Neural Networks: A Condensed 25 
Discussion," From Statistics to Neural Networks: Theory 
and Pattern Recognition Application, V. Cherkassky, J. H. 
Friedman and H. Wechsler eds., 1994 and White, H., "Para- 
metric Statistical Estimation with Artificial Neural 
Networks," P. Smolensky, M. C. Mozer and D. E. Rumelhart 30 
eds., Mathematical Perspectives on Neural Networks, L. 
Erlbaum Associates (to appear), Hilldale, N.J., 1995. For an 
elegant and concise overview of distributional information 
measures in general, see Ullah, A., 1996, supra. A brief 
introduction of KLIC is provided below. 35 
1. Relative Entropy 

Let P and R be two candidate session generating processes 
(i.e., probability measures) over the set of permissible 
sessions, an index into which we have denoted Q. More 
precisely, P and R are probability measures on (Q, 2°). We 40 
wish to determine which of P and R is responsible for 
generating a given (realization of a) session s-S(co) where 
ooeQ. 

Let Q be a probability measure that dominates both P and 
R. This means that for each permissible set of sessions A, 45 
Q(A)=0 implies P(A)=0 and R(A)=0. In some practical 
circumstances Q may equal R. Let p P =dP/dQ, and p^=dR/ 
dQ, representing associated (Radon-Nikodym) type density 
functions. Applying Kullback and Liebler, the log density 
ratio is logtp^coyp^co)] as the information in o for dis- 50 
criminating between P and R. This quantity is known as the 
log likelihood radio and may be optimal in a variety of 
senses for discriminating between P and R. 
The expected value of the log likelihood ratio yields the 
KLIC: 55 

If Web sessions are generated by P, then the KLIC 
quantifies the information theoretic measure of "surprise" 
experienced on average when sessions are described by P 60 
and are described by R. When the intersection of the 
supports of R and Q is nonempty and the integral is taken 
over a finite space, this simplifies to: 



Reference may also be made to the following two quan- 
tities: 



and 



P^yp*(*), 



Pi <5)l0g(p^)/p ff (5)). 



I(Pr.p*)-Vs*H*PP(R) (p^)log(p^)/p„(*)). 

and is commonly referred to as "cross-entropy" or "relative 
entropy." Accordingly, let K(P:R)-I(p P :p /J ). 



65 



The former is the likelihood ratio. The latter is the informa- 
tion in s for discriminating between P and R. 
2. Information and Entropy 

Several forms of information and entropy exist. Informa- 
tion and entropy are closely related, as entropy is "minus" 
the information [Ullah, A., 1996, "Entropy, Divergence and 
Distance Measures with Econometric Applications," Work- 
ing paper in Economics, Department of Economics, Uni- 
versity of California, Riverside, USA 92521,7. of Statistical 
Planning and Inference, 49:137-162.]. KLIC generalizes 
the notion of entropy. Shannon-Wiener entropy is a special 
case of the KLIC that arises when R dominates P [White, H., 
1994, supra]. Entropy measures the "uncertainty" of a single 
distribution (cf. [Khinchin, A., Mathematical Foundations 
of Information Theory, Dover Publications, Inc., NY, 1957], 
[Ullah, A., 1996, supra]). 

To illustrate the difference between Shannon-Wiener 
entropy and the KLIC, consider for example a finite prob- 
ability scheme. When applied to a "complete system" of n 
events (a mutually exclusive set of which one and only one 
must occur at each trial) uncertainty is maximized when all 
events are equally likely (giving each a pointwise probabil- 
ity or "density" of n" 7 ). Furthermore, given that all events 
are equally likely, uncertainty increases with the number of 
events n. For a finite system, uncertainty is minimized when 
all likelihood concentrates on a single point, in which case 
the entropy is zero. 

By comparison, the KLIC is a relative measure of infor- 
mation available for distinguishing a target distribution from 
a reference distribution. It's absolute value is minimized 
when the target is indistinguishable from the reference — in 
this case knowing the reference implies knowing the target. 
For a finite system in which each event is equally likely 
under the reference distribution, the KLIC is equal to minus 
the Shannon- Wiener entropy (discussed above) of the target 
distribution plus a constant. In the preferred embodiment, 
the present invention requires KLIC (relative entropy) 
because traditional entropy measure methods rank superflu- 
ous association rules highly, exactly the problem that the 
present invention addresses. One reason superfluous asso- 
ciation rules may be highly ranked is because even "ran- 
domized" visitor behavior can be highly structured — 
therefore, have low entropy — due to traffic flow constraints 
imposed by the web sit topology. 

For additional background on the KLIC, White, H., 1994, 
supra and White, H., 1995, supra, may be consulted and for 
a concise yet comprehensive survey of KLIC compared and 
contrasted with other information measures (including 
Shannon-Wiener information and mutual information) see 
Ullah, A., 1996, supra. 

The above discussion relating to definitions used to 
explain the methods of the invention and the environment in 
which the methods may be practiced should be particularly 
helpful in understanding how the methods are implemented, 
and the hardware associated therewith. 

Hardware Components & Interconnections 

One aspect of the invention concerns an apparatus for 
extracting desired data relationships from a web site 
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database, which may be embodied by various hardware 
components and interconnections as described in FIG. 1. 

Referring to FIG. 1, a data processing apparatus 100 for 
analyzing databases for generalized association rules is 
illustrated. In the architecture shown, the apparatus 100 
includes one or more digital processing devices, such as a 
client computer 102 having a processor 103 and a server 
computer 104. In one embodiment, the server computer 104 
may be a mainframe computer manufactured by the Inter- 
national Business Machines Corporation of Armonk, N.Y., 
and may use an operating system sold under trademarks 
such as MVS. Or, the server computer 104 may be a Unix 
computer, or OS/2 server, or Windows NT server, or IBM 
RS/6000 530 workstation with 128 MB of main memory 
running AIX 3.2.5. The server computer 104 may incorpo- 
rate a database system, such as DB2 or ORACLE, or it may 
access data on files stored on a data storage medium such as 
disk, e.g., a 2 GB SCSI 3.5" drive, or tape. Other computers, 
servers, computer architectures, or database systems than 
those discussed may be employed. For example, the func- 
tions of the client computer 102 may be incorporated into the 
server computer 104, and vice versa. 

FIG. 1 shows that, through appropriate data access pro- 
grams and utilities 108, a mining kernel 106 accesses one or 
more databases 110 and/or flat files (i.e., text files) 112 
which contain data chronicling transactions. After executing 
the steps described below, the mining kernel 106 outputs 
association rules it discovers to a mining results repository 
114, which can be accessed by the client computer 102. 

Additionally, FIG. 1 shows that the client computer 102 
can include a mining kernel interface 116 which, like the 
mining kernel 106, may be implemented in suitable com- 
puter code. Among other things, the interface 116 functions 
as an input mechanism for establishing certain variables, 
including the minimum support value or minimum confi- 
dence value. Further, the client computer 102 preferably 
includes an output module 118 for outputting/displaying the 
mining results on a graphic display 120, print mechanism 
122, or data storage medium 124. 

Operation 

In addition to the various hardware embodiments 
described above, a different aspect of the invention concerns 
a method for applying association rules to sessionalized web 
server log data. Throughout the following description, a 
given set of sessions from which web server log data is 
gathered may be treated as a realization of a random variable 
drawn from a stationary data generating process. 

One use of an emulated distribution is to simulate the 
behavior of actual visitors. The procedure may comprise in 
one embodiment parameterizing the Web Walk Emulator to 
closely match the behaviors of actual visitors and fine-tuning 
these parameters to minimize the relative entropy diver- 
gence between the emulated and actual distributions by, for 
example, way of gradient local optimization or by global 
optimization over a computational grid laid down over 
parameter space, or by a combination of global and local 
search. The resulting optimized parameterization can be 
used to generate the population statistics exhibited by the 
original visitors. 

Consider the task of comparing user behavior from his- 
torical data with current day behavior. As a simple means of 
accomplishing this comparison, historical data can be saved 
and used for future comparisons. However, this approach 
has several drawbacks: 

1. The amount of data can be massive, requiring excessive 
storage space, 
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2. Even if storage space is not an expensive resource, 
processing such a large amount of data may be more 
expensive than is necessary, especially if the data is 
highly redundant. One solution is to compress the data 

s into a set of sufficient statistics and then save only that 
compressed data set. 

3. While compressing a store of data into a set of sufficient 
statistics can be desirable, anticipating what statistics to 
calculate for future reference can be difficult, especially 

10 where subjective preferences change over time: what is 
considered important today may not be interesting 
tomorrow. Conversely, information considered discard- 
able today may become useful tomorrow. 
Having "emulated users" with the same behavioral char- 

15 acteristics as historical users allows us to evaluate an arbi- 
trary set of statistics at a later time, including statistics that 
were invented after the historical data was observed. It is 
possible to create hypothetical situations that were not 
presented to the historical users, and computationally "imag- 

20 ine" what behaviors historical users night have exhibited if 
subjected to the hypothetical set of choices. In the present 
invention, "emulation" combined with "simulation" allows 
hypothetical situations to be considered, such as, "how 
would last year's users react to this year's web site struc- 

25 ture?" This behavior can then be used as a reference distri- 
bution for comparison against this year's behavior. The 
following discussion discloses the methods of the present 
invention that may be used by the Web Walk Emulator for 
detecting meaningful URL-URL associations. 

30 Overall Sequence of Operation 

FIGS. 2-7 show several methods to illustrate examples of 
the various method embodiments of the present invention. 
For ease of explanation, but without any limitation intended 
thereby, the examples of FIGS. 2-7 are described in the 

35 context of the apparatus 100 described above. 
1. Ranking Association Rules by Support 
a. Sets of Rules over a Complete System of Events 
In the present invention, a complete system of events may 
be a mutually exclusive set of which one and only one must 

40 occur at each trial. Association rules can be applied to 
measure the strength of association between an event A and 
a set of options that comprise a complete system, say one 
having n events such as B=(B1, B 2 , . . . , B„). This generates 
n association rules: A->B lf A-*-B 2 - . - A-*B„. One distri- 

45 bution of interest is given by examining the "support** of 
each rule. Let P give the target probability distribution for 
actual visits contained in a database of step 204 shown in 
FIG. 2. The "support" of rule A-*B 2 over the set of sessions 
observed from actual visits (actual support) measures a 

50 specific quantity — defined above with respect to Association 
Rules — that is directly observable in empirical samples. 
Although strictly speaking "support" measures an empirical 
relative frequency, given a sufficient amount of data and a 
sufficiently "stationary" data generating process, it may be 

55 used to estimate the generally unobservable P(A,B 1 ). 
Therefore, the support s may be used as an estimate of the 
probability P(A3i) that defines the data generating process 
underlying the original data set. For clarity, probabilistic 
quantities (such as P(A, Bj)) rather than their estimates s 

60 will be discussed. However, the following holds true if s is 
used instead. 

A rule such A-^ can be evaluated over different real- 
izations of the same type of data, such as that produced by 
different realizations (e.g., as provided by a generative 
65 model such as the Web Walk Emulator, or, as observed from 
the same web site over a different time span) and stored in 
the database of step 204. If R(A,BJ gives the support of this 
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rule as measured over emulated visits (henceforth we refer 
to it as "emulated support") in step 206, two probability 
distributions over n events are considered that can be 
compared via relative entropy, namely, P A ^oP(A3i),P(A, 
B 2 ), . . . P(A,B„) and R >U ,-R(A,B 1 ),R(A,B 2 ), • • • > R(A3„). 
Further, K^P^iR^-O, where it is some constant value, if 
and only if these distributions are identical. One way to 
apply this is to compare V AB with a different set of associa- 
tion rules, say P AC =P(A,C 1 ) ^(^CJ, . . . ,P(A,C„,), for some 
integer m and complete system of events C=(C lt C^, . . . , 
C„,), by computing K(P AC :R AC ) and comparing with 
Wab^ab) in step 208. If K(P AC :R AC )>K(P A5 :R Afl ), then 
association rules applied to the system of events C have 
higher relevance on average (as compared against the back- 
drop of the reference R AC ) than that observed for rules over 
B (as compared against the backdrop of the reference R^). 
The method ends in step 210. 

b. Individual Rules over a Complete System of Events 
The ranking method discussed above with respect to FIG. 
2 compares the relevance of two sets of rules in which the 
consequents of the rules comprise a complete set of events, 
where the relevance of each set is measured by comparing 
its associational support against the reference given by an 
emulated distribution. However, rules within the same set 
may also be compared as shown in FIG. 3. Relative entropy 
is a measure of "expected" information content for discrimi- 
nating between two distributions — i.e., it is an average value 
of a pointwise measure. This pointwise measure can be used 
to compare individual rules within a set of rules. More, 
precisely: it can be used to compare measures over a set of 
rules, given that these measures comprise a probability 
distribution. 

For example, if the rules A->B 15 A-*B 2 , . . . , A-*B M 
contained in the database of step 304 are to be ranked 
according to their "surprise" in the sense that the rule 
support measured over the actual users is large relative to the 
rule support measured over emulated visits, one way — based 
upon the relationship of pjJ(s)fp R (s) discussed above with 
regards to relative entropy — is to sort the quantities in 
descending order in step 306: 

P^B x )fR(A t B^ t P(/i,B£/R(A,BJ, . . . , P(A t B n )IR(A,B n ). 

Sorting these likelihood ratios is equivalent to traversing 
^aJ^ab anc * looking for places where it deviates signifi- 
cantly from 1, here P /1 ^-P(A,B 1 ),P(A,B2), . . . , P(A,BJ and 
R^-R(A,Bj),R(A,B 2 ), . . . , R(A,B„). Peaks (ratios much 
greater than 1) show where the support of rules as described 
by P^ is significantly greater than that described by R^. 
Dips (ratios much less than 1) show where the support under 
P^ is unusually lower than what is suggested by R^. 
Ratios close to 1 show where the behavior of actual users is 
consistent with that emulated users. The method ends in step 
308. 

C. Individual Rules over a Complete System of Events, 
Weighted by Support 

Sorting rules by their support Likelihood as discussed 
above with respect to FIG. 3 results in rules with very high 
or very low likelihood ratios being distinguished, even if 
those rules have negligible support. An appropriate solution 
is to sort in step 406 of FIG. 4 the quantities based upon the 
relationship of p jP (s)log(p i >^ >/pR (s)), after accessing a data- 
base in step 404 discussea above with regard to relative 
entropy: 

PfrBJlogiPfrBJfRfaBj), P(fl,BJ\og{P(AB^IR<fi,Bj)> - - - , 
PiA,B^og{PiA,B^IR^B n )). 



50,153 Bl 

12 

The "pointwise" information derived in this sorting 
method, given two rules for which the likelihood ratios are 
equal, breaks the "tie" by considering the actual support. 
Therefore, if using the ratio to detect unusually high support 
5 (sorting in descending order in step 408 and picking rules 
that rise to the top in step 410), then the rule with the higher 
actual support will prevail in step 412. If using the ratio to 
detect unusually low support (sorting in ascending order in 
step 414 and picking rules that rise to the top in step 416), 
10 then the rule with the lower actual support will prevail in 
step 418. The method ends in step 420. 

d. Ranking Rule Support over Non-Mutually Exclusive 
Choices 

If associations are applied to URLs in the database step 
15 502 of FIG. 5 that do not comprise a complete system of 
events in step 504, a users choice or event may not be 
mutually exclusive. If the complete system of events has 
occurred, a different method for sorting is applied in step 
506, for example, a method as shown in FIGS. 2-4. 
20 However, "incomplete" events may be converted into a 
probability distribution by letting event D 1 correspond to a 
set of URLs, where a head and a body of an association can 
refer to a single URL or to a set of URLs in step 508, and 
event corresponds to "not D a " — i.e., all other URLs 
25 besides those in D Jt Therefore, {D lt D2 , . . . , D m } can be an 
arbitrary set of objects that are not mutually exclusive and 
can be examined in step 510 by sorting "m" quantities as 
follows: 

30 {^D^R^D^+P^D^R^D^l}, M,2, . . . ,m. A. 

This indicates for each i where the observed support for 
the two rules {A-^nLA-^D,} is much different than that 
exhibited in the reference distribution, for example, as 
35 provided by a generative model such as the Web Walk 
Emulator. In order to give preference to rules having higher 
support, relative entropy is applied in step 512, sorting the 
following m quantities: 

4Q '"1,2, . . . , m. B. 

This compares the pairs of rules {(A-^D^ 
A^DjMA-DzA^DJ . . . {(A-D m ,A^Dj,} with 
each other on the basis of whether their support over one 
data set is unusually high (respectively, low) as compared 

45 with the support as evaluated over a data set representing a 
baseline reference distribution. The method ends in step 514. 

Both of the sorting method notations expressed in this 
section and FIG. 5Aare applications of general methods of 
converting each rule into a corresponding distribution, and 

50 then using a distributional measure (average likelihood ratio 
in (A), and relative entropy in (B)) to compare the resulting 
distribution with a baseline reference. However, the follow- 
ing quantities may be sorted instead as shown in steps 610 
and 612 of FIG. 6: 

55 

{/^AfAWW}. wa ■ • c. 
{PV\,D$otfPfaDdfRfrD l ))}, i-1 A . . . ,m. D. 

60 Statement (C) may be interpreted as seeking out rules that 
have unusual levels of support compared to a baseline 
reference distribution, regardless of whether or not the rules 
are highly supported in the available data as determined by 
P. Statement (D) also seeks out rules having unusually high 

65 or low support, but weights them according to their support 
over the observed data as determined by P such that given 
two rules with identical likelihood ratios, the one with 
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greater support will be sorted closer to the head (or tail) of 
the rank ordering. The method ends in step 614. 
2. Ranking Rules by Their Confidence 

In another embodiment, and under the appropriate con- 
ditions (e.g., sufficient data, stationary data generating s 
process), association rules' measures of "confidence" can be 
used in one method to estimate conditional probabilities. In 
particular, the confidence of rule A-»B gives a useable 
estimate of the conditional probability P(A|B). The same 
techniques as described immediately above may be applied 10 
for rule support to compare the confidence of rules against 
a baseline reference distribution. Relationships such as 
defined in statements (C) and (D) above are easily applied to 
evaluating rule confidence. With substitution of the appro- 
priate conditional probabilities, the relationships and the 
rules are sorted in steps 706 and 708 where: 

{(PVilDMRiAlD,)}, 1=1,2, . . . , m. E. 



{(P(A,D>g(^|Ay*(AlA))}, MA • • • , m- 
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Sorting likelihood ratios as described above are equiva- 
lent to traversing the distribution Pas/Raj? and looking for 
places where it deviates significantly from 1. Peaks (ratios 
much greater than 1) show where the confidence of rules 
under P^g is significantly greater than what is suggested by 25 
Ra*, and dips (ratios close to 0) show where the confidence 
under P AB is unusually lower than what is suggested by R^. 
Comparatively speaking, the interpretation of the relation- 
ship statement (E) is not as tidy because the conditional 
probabilities do not in general lend themselves to forming a 3Q 
probability distribution; for each i in statement (E) simply 
delivers a point wise measure of the information content for 
discriminating between the two distributions {P(A|D ( .), 
P(Aj->D;)} and {R(A|D I ),R(A(-'D i )}. The relationships in 
statement (F) add the benefit of giving emphasis to rules 
with greater support, which is ideally suited to the deter- 35 
mining applications for which these techniques are intended. 
The method ends in step 710. 
Data Storage Device 

Such methods as discussed above may be implemented, 
for example, by operating the processor 103 of the client *o 
computer 102 shown in FIG. 1 to execute a sequence of 
machine-readable instructions. These instructions may 
reside in various types of data storage medium. In this 
respect, one aspect of the present invention concerns an 
article of manufacture, comprising a data storage medium 45 
tangibly embodying a program of machine-readable instruc- 
tions executable by a digital data processor to perform 
method steps to extract desired data relationships from web 
site data. 

This data storage medium may comprise, for example, 5Q 
RAM contained within the client computer 102. 
Alternatively, the instructions may be contained in another 
data storage medium, such as a magnetic data storage 
diskette 800 (FIG. 8). Whether contained in the client 
computer 102 or elsewhere, the instructions may instead be 
stored on another type of data storage medium such as 55 
DASD storage (e.g., a conventional "hard drive" or a RAID 
array), magnetic tape, electronic read-only memory (e.g., 
CD-ROM or WORM), optical storage device (e.g., 
WORM), paper "punch" cards, or other data storage media. 
In an illustrative embodiment of the invention, the machine- 60 
readable instructions may comprise lines of compiled C-type 
language code. 

Other Embodiments 

While there have been shown what are presently consid- 65 
ered to be preferred embodiments of the invention, it will be 
apparent to those skilled in the art that various changes and 



modifications can be made herein without departing from 
the scope of the invention as defined by the appended 
claims. 

What is claimed is: 

1. A method for sorting data mining association rules, the 
method comprising: 

identifying statistically significant relationships within a 
cumulated distribution of data, the significant relation- 
ships represented by association rules; and 

separating meaningful association rules from unmeaning- 
fiil association rules using an emulated distribution of 
the data as a reference, wherein said emulated distri- 
bution is based upon emulated events that are different 
than actual events. 

2. The method recited in claim 1, separating meaningful 
association rules from unmeaningful association rules by 
sorting the rules by their support within the distribution of 
data. 

3. The method recited in claim 1, separating meaningful 
association rules from unmeaningful association rules by 
sorting the rules by their confidence within the distribution 
of data. 

4. The method recited in claim 2, wherein the association 
rules are sorted in sets of rules for different systems of 
events, where C represents an emulated system of events for 
a web site and B represents an actual system of events for the 
same web site data, and where if I^P^R^^I^P^R^), 
the association rules applied to the emulated system of 
events C have greater relevance then the association rules 
applied to the actual system of events B, and where P is a 
probability distribution for an association rule for the actual 
system of events, and R is an emulated support for an 
association rule for the emulated system of events, and 
wherein K is a constant. 

5. The method recited in claim 4, wherein P is measured 
by an association rule's support in a respective system of 
events. 

6. The method recited in claim 5, wherein P xfl =P(A, B 1 ), 
P(A, Bz), . . . P(A, B„) and R^«R(A, B a ), R(A, B 2 ) . . . R(A, 
BJ, and P AC oP(A, CJ, P(A, CJ, . . . P(A, BJ. 

7. The method recited in claim 2, wherein the association 
rules are sorted in descending order of support. 

8. The method recited in claim 7, where P(A,B 1 )/R(A,B 1 ), 
P(A,B2)/R(A,B2), . . . P(A,BJ/R(A,BJ, where B represents 
an event in an actual system of events, and where the 
association rules pertaining to the actual system of events 
have equal support as association rules pertaining to an 
emulated system of events if Pa/?/RaW» where P is a 
probability distribution for an association rule for the actual 
system of events, and R is an emulated support for an 
association rule for the emulated system of events, and 
where P^ has lesser support than R^ as P A ^— »0 and 
Rajs-* 100 , and where P^ has greater support than R AB as 
r a*-*0 and P^—oo. 

9. The method recited in claim 2, wherein the association 
rules are sorted in ascending order of support. 

10. The method recited in claim 9, where R(A,B 1 )/P(A, 
B,), R(AJB 2 )/P(A,B 2 ), . . . R(A3J/P(A,BJ, where B rep- 
resents an actual system of events, and where the association 
rules pertaining to the actual system of events have equal 
support as the association rules pertaining to an emulated 
system of events if Rab/Paa 1 *! where R is an emulated 
support for an association rule for an emulated system of 
events, and P is a probability distribution for an association 
rule for the actual system of events, and where R^ has 
lesser support than P AD as R AZ? -*0 and P AB -*°°, and where 
Raj has greater support than P^ as P AB -*0 and R^-* 00 . 

11. The method recited in claim 2, wherein the association 
rules are further sorted where P(A3 1 )log(P(AJB 1 )/R(A, 
BJ), P(A,B 2 )log(P(A,B 2 )/R(A,B 2 )), . . . , P(A,BJlog(P(A, 
BJ/R(A,B„)). 
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12. The method recited in claim 8, wherein the association 
rules are farther sorted where P(A3!)log(P(A3i)/R(AJ3i), 

P(A,B 2 )log(P(A,B 2 )/R(A,B 2 )) P(A,BJlog(P(A,BJ/R 

(AJBJ). 

13. The method recited in claim 10, wherein the associa- 
tion rules are further sorted where P(A r Bi)log(P(AJBi)/R 
(A.BJ, P(A3 2 )log(P(A,B 2 )/R(A3 2 )), - . . , P(A,B„)log(P 
(AA)/R(A3„). 

14. The method recited in claim 2, wherein event D x 
corresponds to a set of uniform resource locator data, and 
event ~ , D 1 corresponds to all other sets of uniform resource 
locators not in set D 1( and where D-{Dj, D 2 , . . . , D m }, 
where the uniform resource locator data sets do not comprise 
a system of events and {(P(A,D i )/R(A,D i )+P(A,-«D^/R(A, 
^D,.))^}, i-1, 2, . . . , m, where i=l, 2, . . . , m. 

15. The method recited in claim 14, sorting further 
comprising {P(A,D 1 )log(P(A,D l -)/R(A,D ( )) + P(A, 
iD^logCPCA-D^/RCA-D^)}, i-1,2, . . . , m, {P(A,D,)/R 
(A,D ( -)}, id, 2, .... m. 

16. The method recited in claim 2, wherein association 
rules having high levels of support compared to the emu- 
lated distribution are ranked highest, regardless of whether 
the association rules are highly supported in the uniform 
resource locator data as determined by P, and where P is a 
probability of an occurrence of the association rule. 

17. The method recited in claim 16, wherein two asso- 
ciation rules with identical P values are further sorted so that 
the rule with greater support in the emulated data is sorted 
higher than the rule with the lesser support. 

18. A method for sorting data mining association rules, the 
method comprising: 

identifying statistically significant relationships within a 
cumulated distribution of data, the significant relation- 
ships represented by association rules; and 

separating meaningful association rules from unmeaning- 
ful association rules using an emulated distribution of 
the data as a reference by sorting the rules by their 
support within the distribution of data, 

wherein the uniform resource locator data does not com- 
prise a system of events and is sorted by m sets of 
uniform resource locator data, where {P(A,D ( )/R(A, 
D i)h i"l 2, . . . , m, and D, corresponds to sets of 
uniform resource locator data 1 to m. 

19. The method recited in claim 18, wherein the associa- 
tion rules are further sorted where {P(A,D / )log(P(A4) I )/R 
(AA))}, i-1, 2, .... m. 

20. A method for sorting data mining association rules, the 
method comprising: 

identifying statistically significant relationships within a 
cumulated distribution of data, the significant relation- 
ships represented by association rules; and 

separating meaningful association rules from unmeaning- 
ful association rules using an emulated distribution of 
the data as a reference by sorting the rules by their 
confidence within the distribution of data, 

sorting of the association rules comprising ranking the 
rules by their confidence, where {P(A|D^/R(A|D,)}, 
i-1, 2, ... f m, where P is a probability of an occurrence 
of an association rule, and where two association rules 
with identical P values are further sorted so that the rule 
with greater support in the emulated data is sorted 
higher than the rule with the lesser support. 

21. The method recited in claim 20, wherein the associa- 
tion rules are further sorted where {P(AJD,)log(P(A|D I )/R 
(A|DJ)}, 2, . . . , m. 

22. An article of manufacture comprising a data storage 
medium tangibly embodying a program of machine- 
readable instructions executable by a digital processing 
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apparatus to perform a method for sorting data mining 
association rules, the method comprising: 

identifying statistically significant relationships within a 
cumulated distribution of uniform resource locator 
5 data, the significant relationships represented by asso- 
ciation rules; and 
separating meaningful association rules from un meaning- 
ful association rules using an emulated distribution of 
the uniform resource locator data as a reference, 
l0 wherein said emulated distribution is based upon emu- 
lated events that are different than actual events. 
23. The article recited in claim 22, separating meaningful 
association rules from unmeaningful association rules by 
sorting the rules by their support within the distribution of 
uniform resource locator data. 
35 24. The article recited in claim 22, separating meaningful 
association rules from unmeaningful association rules by 
sorting the rules by their confidence for support within the 
distribution of uniform resource locator data. 
25. The article recited in claim 22, wherein the association 
20 rules are sorted in sets of rules for different systems of 
events, where C represents an emulated system of events for 
a web site and B represents an actual system of events for the 
same web site data, and where if K(P 4C :R AC )>K(P AJ? :R AS ), 
the association rules applied to the emulated system of 
events C have greater relevance then the association rules 
25 applied to the actual system of events B, and where P is a 
probability distribution for an association rule for the actual 
system of events, and R is an emulated support for an 
association rule for the emulated system of events, and 
wherein K is a constant. 
3Q 26. The article recited in claim 25, wherein P is measured 
by an association rule's support in a respective system of 
events. 

27. The article recited in claim 22, wherein P /tB =P(A, BJ, 
P(A, BJ, . . . P(A, B J and R^=R(A, B,), R(A, B 2 ) . . . R(A, 
BJ, and P AC =P(A, Q), P(A, CJ, . . . P(A, BJ. 
35 28. The article recited in claim 22, wherein the association 
rules are sorted in descending order of support. 

29. The article recited in claim 22, where P(A,Bj)/R(A, 
BJ, P(A,B 2 )/R(A,B 2 ), . . . P(A,B n )/R(A,BJ, where B rep- 
resents an event in an actual system of events, and where the 
association rules pertaining to the actual system of events 

40 have equal support as association rules pertaining to an 
emulated system of events if Pa^/Ra^I, where P is a 
probability distribution for an association rule for the actual 
system of events, and R is an emulated support for an 
association rule for the emulated system of events, and 

45 where P AB has lesser support than R^ as P^-^0 and 
Ras - * 00 * arj d where V AD has greater support than R AB as 
and P^-oo. 

30. The article recited in claim 22, wherein the association 
rules are sorted in ascending order of support. 

50 31. The article recited in claim 22, where R(A,B 1 )/P(A, 

. BJ, R(A,B 2 )/P(A,B 2 ), . . . R(A,BJ/P(A,BJ, where B rep- 
resents an actual system of events, and where the association 
rules pertaining to the actual system of events have equal 
support as the association rules pertaining to an emulated 

55 system of events if Rao/PaW where R is an emulated 
support for an association rule for an emulated system of 
events, and P is a probability distribution for an association 
rule for the actual system of events, and where R^g has 
lesser support than P^ as R A ^-*0 and Pa^-* 00 , and where 
R^b has greater support than P^ as P A ^-»0 and Rab-*°°. 

60 32. The article recited in claim 22, wherein the association 
rules are further sorted where P(A^ 1 )log(P(A,B 1 )/R(A, 
B,)), P(A,B 2 )log(P(A,B 2 )/R(A,B 2 )), . . . , PCAJBJlog(P(A, 
B„)/R(AJB„))- 

33. An article recited in claim 29, wherein the association 
65 rules are further sorted where P(A3 1 )log(P(A3 1 )/R(A, 
B,)), P(A,B 2 )log(P(A,B 2 )/R(A,B 2 )), . . . , P(AJ3„)log(P(A, 
BJ/R(A,B n ). 
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34. An article recited in claim 31, wherein the association 
rules are further sorted where P(A,B,)log(P(A,B 1 )/R(A, 

Bj)), P(A,B 2 )log(P(A,B 2 )/R(A,B 2 )) P(A,B„)log(P(A, 

B„)/R(A3„). 

35. The article recited in claim 22, wherein event D a 5 
corresponds to a set of uniform resource locator data, and 
event ~^ t D 1 corresponds to all other sets of uniform resource 
locators not in set D 1P and where D«{Dj, D 2 , . . . , D m }, 
where the uniform resource locator data sets do not comprise 
a system of events and {(P(A,D i )/R(AJ) i )+P(A,"«D i )/R(A, 
-•D^)^}, i»l, 2, . . . , m, where i=l, 2, . . . , m. 

36. The article recited in claim 22, the method steps 
further comprising: {(P(A,D J )log(P(A,D i -)/R(A,D I -))+P(A, 
-D^g^A-D^A-D,))}, i-1,2, . . . , m, {P(AJD i )/R 
(A,D,.)}, 2, .... m. 

37. The article recited in claim 22, wherein association 15 
rules having high levels of support compared to the emu- 
lated distribution are ranked highest, regardless of whether 
the association rules are highly supported in the uniform 
resource locator data as determined by P, and where P is a 
probability of an occurrence of the association rule. 20 

38. The article recited in claim 22, wherein two associa- 
tion rules with identical P values are further sorted so that the 
rule with greater support in the emulated data is sorted 
higher than the rule with the lesser support. 

39. An article of manufacture comprising a data storage 25 
medium tangibly embodying a program of machine- 
readable instructions executable by a digital processing 
apparatus to perform a method for sorting data mining 
association rules, the method comprising: 

identifying statistically significant relationships within a 3Q 
cumulated distribution of uniform resource locator 
data, the significant relationships represented by asso- 
ciation rules; and 

separating meaningful association rules from unmeaning- 
nil association rules using an emulated distribution of 35 
the uniform resource locator data as a reference, 
wherein said emulated distribution is based upon emu- 
lated events that are different than actual events; 

wherein the uniform resource located data does not com- 
prise a system of events and is sorted by m sets of 40 
uniform resource locator data, where {P(A,D ( )/R(A, 
D t -)}, i=l, 2, . . . , m, and D,- corresponds to sets of 
uniform resource locator data 1 to m. 

40. The article recited in claim 22, wherein the association 
rules are further sorted where {P(A,D f -)log(P(A,D ( )/R(A, 45 
DJ)}, 2, .... m. 

41. An article of manufacture comprising a data storage 
medium tangibly embodying a program of machine- 
readable instructions executable by a digital processing 
apparatus to perform a method for sorting data mining 5Q 
association rules, the method comprising: 

identifying statistically significant relationships within a 
cumulated distribution of uniform resource locator 
data, the significant relationships represented by asso- 
ciation rules; and 55 

separating meaningful association rules from unmeaning- 
nil association rules using an emulated distribution of 
the uniform resource locator data as a reference, 
wherein said emulated distribution is based upon emu- 
lated events that are different than actual events; go 

sorting of the association rules comprising ranking the 
rules by their confidence, where {P(A\D i )/R(A\D t -)}, 
i=l, 2, . . . , m, where P is a probability of an occurrence 
of an association rule, and where two association rules 
with identical P values are further sorted so that the rule 65 
with greater support in the emulated data is sorted 
higher than the rule with the lesser support. 



42. The article recited in claim 22, wherein the association 
rules are further sorted where {PCA.DJIogCPCAlD,.)/ 
R(A|D,))}, 2, . . . , m. 

43. An apparatus to sort data mining association rules, 
comprising: 

a processor; 

a database including URL data; 

circuitry to communicatively couple the processor to the 
database; 

storage communicatively accessible by the processor, the 
processor sorting mining association rules by: 
identifying statistically significant relationships within 
a cumulated distribution of uniform resource locator 
data, the significant relationships represented by 
association rules; and 
separating meaningful association rules from un mean- 
ingful association rules using an emulated distribu- 
tion of the uniform resource locator data as a 
reference, wherein said emulated distribution is 
based upon emulated events that are different than 
actual events. 

44. The apparatus recited in claim 43, the processor 
further sorting by: separating meaningful association rules 
from unmeaningful association rules by sorting the rules by 
their support within the distribution of uniform resource 
locator data. 

45. The apparatus recited in claim 43, separating mean- 
ingful association rules from unmeaningful association rules 
by sorting the rules by their confidence for support within 
the distribution of uniform resource locator data. 

46. The apparatus recited in claim 43, wherein the asso- 
ciation rules are sorted in sets of rules for different systems 
of events, where C represents an emulated system of events 
for a web site and B represents an actual system of events for 
the same web .site data, and where if K(P AC :R AC )>K 
(P AB 'R AB ), the association rules applied to the emulated 
system of events C have greater relevance then the associa- 
tion rules applied to the actual system of events B, and where 
P is a probability distribution for an association rule for the 
actual system of events, and R is an emulated support for an 
association rule for the emulated system of events, and 
wherein K is a constant. 

47. The apparatus recited in claim 43, wherein P is 
measured by an association rule's support in a respective 
system of events. 

48. The apparatus recited in claim 43, wherein P Aff =P(A, 
BJ, P(A, BJ, . . . P(A, BJ and R A/ ,-R(A, BJ, R(A, BJ . . . 
R(A, B„, and P AC «P(A, CJ, P(A, CJ, . . . P(A, BJ. 

49. The apparatus recited in claim 43, wherein the asso- 
ciation rules are sorted in descending order of support. 

50. The apparatus recited in claim 43, where P(A,BJ/R 
(A,BJ, P(A,B 2 )/R(A,B 2 ), . . . P(A,BJ/R(A,BJ, where B 
represents an event in an actual system of events, and where 
the association rules pertaining to the actual system of 
events have equal support as association rules pertaining to 
an emulated system of events if P^/R^-l, where P is a 
probability, distribution for an association rule for the actual 
system of events, and R is an emulated support for an 
association rule for the emulated system of events, and 
where ? AB has lesser support than R^ as P A// — *0 and 
R Aff -*o°, and where P^ has greater support than R AB as 
R^-K) and P A ^°o. 

51. The apparatus recited in claim 43, wherein the asso- 
ciation rules are sorted in ascending order of support. 

52. The apparatus recited in claim 43, where R(A3i)/P 
(A,BJ, R^BJ/P^BJ, . . . R(A3J/P(A3J, where B 
represents an actual system of events, and where the asso- 
ciation rules pertaining to the actual system of events have 
equal support as the association rules pertaining to an 
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emulated system of events if R A b/Pab~1 where R is an 
emulated support for an association rule for an emulated 
system of events, and P is a probability distribution for an 
association rule for the actual system of events, and where 

has lesser support than as R Afr -*-0 and P^-^oc, and 5 
where has greater support than P^ as P AB -*0 and 

53. The apparatus recited in claim 43, wherein the asso- 
ciation rules are further sorted where P(A,B 1 )log(P(A,B 1 )/ 
R^BJ), PCA^logWA^mCA^)), . . . , P(A,B„)log 
(P(A3„)/R(A3„)). 10 

54. The apparatus recited in claim 50, wherein the asso- 
ciation rules are further sorted where P(A, Bj)log(P(A, 
B,)/R(A3i)), P(A, B^log^A, BJ/^A, BJ), . . . , P(A, 
B„)log(P(A, B„)/R(A, B„)). 

55. The apparatus recited in claim 52, wherein the asso- is 
ciation rules are further sorted where P(A, Bj)log(P(A, 

P(A, B^log^A, B 2 )/R(A, Bj)), . . . , P(A, 
B„)log(P(A,B„)/R(A,B„)). 

56. The apparatus recited in claim 43, wherein event D x 
corresponds to a set of uniform resource locator data, and 2 o 
event ~ I D 1 corresponds to all other sets of uniform resource 
locators not in set D 1( and where D={Dj, D 2 , . . . , D m }, 
where the uniform resource locator data sets do not comprise 

a system of events and {(P(A,D £ )/R(A,D i )+P(A,-»D i )/R(A, 
-"D,-))^}, i=l, 2, . . . , m, where i=l, 2, . . . , m. 25 

57. The apparatus recited in claim 43, sorting further 
comprising {P(A,D f )log(P(A,D ( -)/R(A,D f ))+P(A, 
-D i )log(P(A,-D J -)/R(A,-D i ))}, i -1, 2, .... m, {P(A,D,)/R 
(AA)}, i-1, 2, .... m. 

58. The apparatus recited in claim 43, wherein association 
rules having high levels of support compared to the emu- 30 
lated baseline reference distribution are ranked highest, 
regardless of whether the association rules are highly sup- 
ported in the uniform resource locator data as determined by 

P, and where P is a probability of an occurrence of the 
association rule. 35 

59. The apparatus recited in claim 43, wherein two 
association rules with identical P values are further sorted so 
that the rule with greater support in the emulated data is 
sorted higher than the rule with the lesser support. 

60. An apparatus to sort data mining association rules, 4 q 
comprising: 

a processor; 

a database including URL data; 

circuitry to communicatively couple the processor to the 

database; 45 
storage communicatively accessible by the processor; the 
processor sorting mining association rules by: 
identifying statistically significant relationships within 
a cumulated distribution of uniform resource locator 
data, the significant relationships represented by 50 
association rules; and 
separating meaningful association rules from unmean- 
ingful association rules using an emulated distribu- 
tion of the uniform resource locator data as a 
reference, wherein said emulated distribution is 55 
based upon emulated events that are different than 
actual events; 

where the uniform resource locator data does not com- 
prise a system of events and is sorted by m sets of 
uniform resource locator data, where {P(A,D,)/R(A, 60 
D^)}, i=l, 2, . . . , m, and D, corresponds to sets of 
uniform resource data 1 to m. 

61. The apparatus recited in claim 43, wherein the asso- 
ciation rules are further sorted where {P(A,DJlog(P(A4D ( )/ 
R(A,D t -))}, i-1, 2, .... m. 65 

62. An apparatus to sort data mining association rules, 
comprising: 



a processor; 

a database including URL data; 

circuitry to communicatively couple the processor to the 
database; 

storage communicatively accessible by the processor; the 
processor sorting mining association rules by: 
identifying statistically significant relationships within 
a cumulated distribution of uniform resource locator 
data, the significant relationships represented by 
association rules; and 
separating meaningful association rules from unmean- 
ingful association rules using an emulated distribu- 
tion of the uniform resource locator data as a 
reference, wherein said emulated distribution is 
based upon emulated events that are different than 
actual events; 

sorting of the association rules comprising ranking the 
rules by their confidence, where {P(A\D I )/R(A\D i -)}, 
i=l,2, . . . , m, where P is a probability of an occurrence 
of an association rule, and where two association rules 
with identical P values are further sorted higher than the 
rule with the lesser support. 

63. The apparatus recited in claim 43, wherein the asso- 
ciation rules are further sorted where {P(A,D t )log(P(A|D / )/ 
RCAlD,))}, i -1, 2, . . . , m. 

64. An apparatus for sorting data mining association rules, 
comprising: 

means for storing URL data; 

means for processing the URL data by: 

identifying statistically significant relationships within 
a cumulated distribution of uniform resource locator 
data, the significant relationships represented by 
association rules; and 
separating meaningful association rules from unmean- 
ingful association rules using an emulated distribu- 
tion of the uniform resource locator data as a 
reference, wherein said emulated distribution is 
based upon emulated events that are different than 
actual events. 

65. The apparatus recited in claim 64, the processing 
means further sorting data by separating meaningful asso- 
ciation rules from unmeaningful association rules by sorting 
the rules by their support within the distribution of uniform 
resource locator data. 

66. The apparatus recited in claim 64, the processing 
means further sorting data by separating meaningful asso- 
ciation rules from unmeaningful association rules by sorting 
the rules by their confidence for support within the distri- 
bution of uniform resource locator data. 

67. The apparatus recited in claim 64, wherein the asso- 
ciation rules are sorted in sets of rules for different systems 
of events, where C represents an emulated system of events 
for a web site and B represents an actual system of events for 
the same web site data, and where if K(P AC :R XC )>K 
(Pab'-Rab), the association rules applied to the emulated 
system of events C have greater relevance then the associa- 
tion rules applied to the actual system of events B, and where 
P is a probability distribution for an association rule for the 
actual system of events, and R is an emulated support for an 
association rule for the emulated system of events, and 
wherein K is a constant. 

68. The apparatus recited in claim 64, wherein P is 
measured by an association rule's support in a respective 
system of events. 

69. The apparatus recited in claim 64, wherein P Aff «P(A, 
B J, P(A, Ba), . . . P(A, B„) and R^-R(A, BJ, R(A, B^ . . . 
R(A, B J, and P^P(A, CJ, P(A, , . . . P(A, B„). 
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